Distance Analysis for Large - Scale Chip Multiprocessors

نویسندگان

  • Meng-Ju Wu
  • Donald Yeung
چکیده

Title of dissertation: REUSE DISTANCE ANALYSIS FOR LARGE-SCALE CHIP MULTIPROCESSORS Meng-Ju Wu, Doctor of Philosophy, 2012 Dissertation directed by: Professor Donald Yeung Department of Electrical and Computer Engineering Multicore Reuse Distance (RD) analysis is a powerful tool that can potentially provide a parallel program’s detailed memory behavior. Concurrent Reuse Distance (CRD) and Private-stack Reuse Distance (PRD) measure RD across threadinterleaved memory reference streams, addressing shared and private caches. Sensitivity to memory interleaving makes CRD and PRD profiles architecture dependent, preventing them from analyzing different processor configurations. However such instability is minimal when all threads exhibit similar data-locality patterns. For loop-based parallel programs, interleaving threads are symmetric. CRD and PRD profiles are stable across cache size scaling, and exhibit predictable coherent movement across core count scaling. Hence, multicore RD analysis can provide accurate analysis for different processor configurations. Due to the prevalence of parallel loops, RD analysis will be valuable to multicore designers. This dissertation uses RD analysis to analyze multicore cache performance for loop-based parallel programs. First, we study the impacts of core count scaling and problem size scaling on CRD and PRD profiles. Two application parameters with architectural implications are identified: Ccore and Cshare. Core count scaling only impacts cache performance significantly below Ccore in shared caches, and Cshare is the capacity at which shared caches begin to outperform private caches in terms of data locality. Then, we develop techniques, in particular employing reference groups, to predict the coherent movement of CRD and PRD profiles due to scaling, and achieve accuracy of 80%–96%. After comparing our prediction techniques against profile sampling, we find that the prediction achieves higher speedup and accuracy, especially when the design space is large. Moreover, we evaluate the accuracy of using CRD and PRD profile predictions to estimate multicore cache performance, especially MPKI. When combined with the existing problem scaling prediction, our techniques can predict shared LLC (private L2 cache) MPKI to within 12% (14%) of simulation across 1,728 (1,440) configurations using only 36 measured CRD (PRD) profiles. Lastly, we propose a new framework based on RD analysis to optimize multicore cache hierarchies. Our study not only reveals several new insights, but it also demonstrates that RD analysis can help computer architects improve multicore designs. Reuse Distance Analysis for Large-Scale Chip Multiprocessors

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

This paper describes Constrained Associative-Mapping-of-Tracking-Entries (C-AMTE), a scalable mechanism to facilitate flexible and efficient distributed cache management in large-scale chip multiprocessors (CMPs). C-AMTE enables fast locating of cache blocks in CMP cache schemes that employ one-to-one or one-to-many associative mappings. C-AMTE stores in percore data structures tracking entries...

متن کامل

Characterization of a List-Based Directory Cache Coherence Protocol for Manycore CMPs

The development of efficient and scalable cache coherence protocols is a key aspect in the design of manycore chip multiprocessors. In this work, we review a kind of cache coherence protocols that, despite having been already implemented in the 90s for building large-scale commodity multiprocessors, have not been seriously considered in the current context of chip multiprocessors. In particular...

متن کامل

Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?

On Chip Multiprocessors (CMP), it is common that multiple cores share certain levels of cache. The sharing increases the contention in cache and memory-to-chip bandwidth, further highlighting the importance of data locality analysis. As a rigorous and hardware-independent locality metric, reuse distance has served for a variety of locality analysis, program transformations, and performance pred...

متن کامل

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

As compared to a complex single processor based system, on-chip multiprocessors are less complex, more power efficient, and easier to test and validate. In this work, we focus on an on-chip multiprocessor where each processor has a local memory (or cache). We demonstrate that, in such an architecture, allowing each processor to do off-chip memory requests on behalf of other processors can impro...

متن کامل

PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors

The second version of the Princeton Application Repository for Shared-Memory Computers (PARSEC) has been released. PARSEC is a benchmark suite for Chip-Multiprocessors (CMPs) that focuses on emerging applications. It includes a diverse set of workloads from different domains such as interactive animation or systems applications that mimic large-scale commercial workloads. The next version of PA...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012